MIC Known Issues
MPI Jobs Hanging Description: Normal MPI jobs confirmed to hang on both BR and HB. The right number of processes start, but some of them go into Sleep state. Diagnosis (10/21/13): The MIC installation adds an extra IB interface - the non-MIC nodes have only mlx4_0 while the MIC nodes also have scif0. MPI was defaulting to the latter. Solution: Resolution to MPI hangs has been identified. Now we need to update the MPI installations to force MPI to use mlx4_0 by default. Testing Original Issue: A basic hello world cannot run using a two node job. It runs successfully with two processes within a node. There seems to be an issue with the fabric that prevents MPI_Init from executing correctly. Here's mvapich2: mpi_norm$ cat hf br002 br002 mpi_norm$ mpirun -np 2 -hostfile ./hf ./mpihw Hello from task 0 on br002! MASTER: Number of MPI tasks is: 2 Hello from task 1 on br002! mpiexec@br002 HYDT_bscd_pbs_wait_for_completion (./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed with TM error 17002 mpiexec@br002 HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion mpiexec@br002 HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for completion mpiexec@br002 main (./ui/mpich/mpiexec.c:325): process manager error waiting for completion mpi_norm$ vi hf mpi_norm$ cat hf br002 br003 mpi_norm$ mpirun -np 2 -hostfile ./hf ./mpihw Max MV2_DEFAULT_MAX_SG_LIST is 0, set to 1 Max MV2_SRQ_SIZE is 0, set to 4096 Max MV2_DEFAULT_MAX_SG_LIST is 0, set to 1 Max MV2_SRQ_SIZE is 0, set to 4096 cli_0: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(436)....: MPID_Init(371)...........: channel initialization failed MPIDI_CH3_Init(292)......: MPIDI_CH3I_RDMA_init(368): rdma_iba_hca_init(871)...: Attributes failed sanity check cli_1: aborting job: Fatal error in MPI_Init: Other MPI error, error stack: MPIR_Init_thread(436)....: MPID_Init(371)...........: channel initialization failed MPIDI_CH3_Init(292)......: MPIDI_CH3I_RDMA_init(368): rdma_iba_hca_init(871)...: Attributes failed sanity check And OpenMPI error messages were quite lengthy, but look something like this: hb001:65181 Signal: Segmentation fault (11) hb001:65181 Signal code: Address not mapped (1) hb001:65181 Failing at address: 0x58 Full error messages can be found in MM_OpenMPI.o40 (for Intel) and MM_OpenMPI.o42 (for GCC) in /home/jkrometi/mic/mpi_norm/ Update (10/21/13): Adding flags or environment variables to force MPI to use mlx4_0 fixes the errors. For mvapich2 set the MV2_IBA_HCA environment variable (or use the -env MV2_IBA_HCA=mlx4_0 flag): mpi_norm$ export MV2_IBA_HCA=mlx4_0 mpi_norm$ mpiexec -np 2 -ppn 1 ./mpihw Hello from task 0 on br007! MASTER: Number of MPI tasks is: 2 Hello from task 1 on br008! mpiexec@br007 HYDT_bscd_pbs_wait_for_completion (./tools/bootstrap/external/pbs_wait.c:68): tm_poll(obit_event) failed with TM error 17002 mpiexec@br007 HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion mpiexec@br007 HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:216): launcher returned error waiting for completion mpiexec@br007 main (./ui/mpich/mpiexec.c:325): process manager error waiting for completion For OpenMPI set the OMPI_MCA_btl_openib_if_include environment variable (or use the --mca btl_openib_if_include mlx4_0:1 flag): mpi_norm$ module swap mvapich2 openmpi mpi_norm$ mpiexec -np 2 -hostfile hf --mca btl_openib_if_include mlx4_0:1 ./mpihw.ompi Hello from task 0 on br007! MASTER: Number of MPI tasks is: 2 Hello from task 1 on br008! mpi_norm$ export OMPI_MCA_btl_openib_if_include=mlx4_0:1 mpi_norm$ mpiexec -np 2 -hostfile hf ./mpihw.ompi Hello from task 0 on br007! MASTER: Number of MPI tasks is: 2 Hello from task 1 on br008! Both the mvapich2 and openmpi solutions have been tested on both the MIC and non-MIC nodes and via qsub. Now we need to update the installations to make these environment variables are set automatically. Updates (10/24/13-11/3/2013): Tested the following apps with success on both MIC and non-MIC-enabled nodes using the flags above: * Gromacs benchmark (/home/arcadm/gromacs/gromacs_mictest.sh) - seemed like a good test of an app built against our MPI stack. * Abaqus multinode (modified version of the Ithaca example here) * I don't have permission to run Ansys, the other key application with its own MPI implementation, so Gene Cliff tested an Ansys Fluent run and it ran cleanly on the MIC nodes. So we should be okay there, too. * Amit ran some tests using OpenMPI and those ran fine as well. The changes to the MPI modules are probably ready to roll out to the cluster as a whole. Update (11/4/2013): Justin updates the mvapich2 and openmpi spec files to set the environment variables identified above and submits a Jira ticket (UAS-664) to get the appropriate changes made to the modulefiles. MIC Libraries Description: Manual copying of libraries to the MICs on BR is required to get basic jobs to run. For example, libiomp5.so needs to be copied to MIC and added to the LD_LIBRARY_PATH for OpenMP jobs to work natively. See Bharath's scripts for more. Status: Resolved. Updates 10/13/13: Original Issue: helloflops2 is a simple OpenMP program from the Jeffers/Reinders book on programming for the MIC: jeffers$ scp helloflops2 mic0: Warning: Permanently added 'mic0,10.11.110.1' (RSA) to the list of known hosts. helloflops2 100% 13KB 12.9KB/s 00:00 jeffers$ ssh mic0 Warning: Permanently added 'mic0,10.11.110.1' (RSA) to the list of known hosts. jkrometi$ ./helloflops2 ./helloflops2: error while loading shared libraries: libiomp5.so: cannot open shared object file: No such file or directory 10/15/13: Justin checked on Stampede and when you SSH into a MIC, /opt/apps is surfaced and the only special environment variable is LD_LIBRARY_PATH=/opt/apps/intel/13/composer_xe_2013.2.146/compiler/lib/mic/ 10/16/13: Chris and Justin decide to mimick what is seen on Stampede by surfacing /opt/apps and setting: LD_LIBRARY_PATH=/opt/apps/intel/13.1/compiler/lib/mic:/opt/apps/intel13_1/mkl/11/lib/intel64 10/21/13: Chris and Brandon work to implement the solution above. This seems to have fixed the issue: omp$ ssh mic1 Warning: Permanently added 'mic1,10.11.210.7' (RSA) to the list of known hosts. ~ $ cd mic/jeffers/ ~/mic/jeffers $ ./helloflops2 Initializing Starting Compute Gflops = 25.600, Secs = 1.531, GFlops per sec = 16.718 MIC Module Unloading Description: BlueRidge has a mic module that sets a number of environment variables. However, when the module is unloaded, the environment variables are not unset correctly. Status: Resolved 10/16/13 by Justin and Chris. HB MIC Module Description: BlueRidge has a mic module that sets a number of environment variables. Once the mkl and mic modules are loaded, MKL offloading seems to work. However, HoneyBadger does not have a mic module. Solution: Create a mic module to set the same environment variables as the BlueRidge module does. (Note: The BR module was changed on 10/16/13 by Justin and Chris to fix an error with the unloading of the module.) Testing Bharath's cblas_dgemm example yields: *>400 Gflops on BR MIC-enabled nodes once the mic module has been loaded. *>400 GFlops on HB once the above environment variables have been set. *No more than 270 GFlops on BR MIC-enabled nodes when MKL_MIC_ENABLE is turned off. *No more than 270 GFlops on HB when the above environment variables are not set. *No more than 270 GFlops on normal (non-MIC) BR nodes. Offload Performance Offload performance appears to be about half what it should be and half what is attained via native mode. Here's performance of a simple example from Jeffers' and Reinders' book on MIC programming on BlueRidge: jeffers$ MIC_KMP_AFFINITY=scatter ./helloflops3offload Initializing Starting Compute on 236 threads Using 236 threads... Gflops = 6041.600, Secs = 7.139, GFlops per sec = 846.243 Here's performance of the same code on Stampede: c557-804$ MIC_KMP_AFFINITY=scatter ./helloflops3offload Initializing Starting Compute on 240 threads Using 240 threads... Gflops = 6144.000, Secs = 3.360, GFlops per sec = 1828.753 An almost identical code achieved 1786.6 Gflops when running in Native mode on BlueRidge. Note: Justin tried increasing the number of iterations to check whether this is overhead associated with offload jobs, such as copying arrays to the MIC. The performance did not improve, so this is likely not the case. The computations are simply being performance half as fast for some reason. Dropped Packets Description: This network issue is local to the MICs themselves. For some reason, there are tons of dropped packets on the MIC: mic0 Link encap:Ethernet HWaddr CA:51:7D:50:93:08 inet addr:10.11.110.1 Bcast:0.0.0.0 Mask:255.255.0.0 inet6 addr: fe80::c089:5bff:fef6:b637/64 Scope:Link UP BROADCAST RUNNING MTU:1500 Metric:1 RX packets:212474 errors:0 dropped:167685 overruns:0 frame:0 TX packets:864 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:34786729 (33.1 MiB) TX bytes:133014 (129.8 KiB) This is a bad enough problem that mounting /home would not work. We came up with a workaround to mount home as a TCP mount, rather than the default UDP mount. TCP has built-in tools that deal with lost packets. Even though we have a working /home inside the MICs now, it's still bad because due to retransmitting so many packets it is WAY slow. We are hoping that we can upgrade the mic environment to the latest available and this bug will be gone. Brandon has already downloaded the latest RPMs and put them onto the cluster. We just need some time to reprovision the nodes.